The Charlottesville bars are looking for ways to increase their attendance next semester and are looking for different ways to do it. One area they have selected is the playlist of songs for their bar. If they are able to find songs that people will be more likely to dance to, this might increase their attendance and popularity. In this project we will be looking into a data set of songs and creating machine learning models to try and predict a danceability rating for them. We will be using three different model types: k-nearest neighbors, (KNN), decision tree and random forest. For each model, we created an initial model and then tried to tune it to be able to best predict our data. We will be looking to predict whether a particular song will be in the top 25% percentile in terms of its danceability rating and will therefore be a song with optimal qualities to place on a playlist for nights at bars. Two metrics that we will be focusing on in this project are specificity as well as the F1 score. Specificity is important for this situation because songs that are classified as being in the top 25% percentile when they should not be could be detrimental to the atmosphere of a bar and could turn away bar-goers. F1 score will also be important in evaluating our models as the data is imbalanced.
We obtained the dataset we are using from Kaggle. It contains songs from the major genres with various metrics about them including acousticness, energy, key, loudness, tempo, genre and several other metrics. Our models will use these various metrics to try and predict whether a song will be in the top 25% percentile in danceability rating. To clean our dataset to prepare to feed it into the models, we first removed variables that we decided not to use either because they were identifying values such as song name and artist or because they had erroneous data in them. For example, about half of the values in the column for the song’s duration reported a duration of -1 milliseconds. We then removed NA values and converted the tempo column to numeric values. We also normalized the popularity, loudness and tempo columns to be between 0 and 1 using a simple min-max scaler. Next, we collapsed the factors of the key column to combine sharp and natural notes of the same type as well as classified the genre and mode columns as factors. We calculated the 75th percentile of the danceability scores and created a binary variable indicating whether a song was in the top 25% or not, replacing the original column with danceability scores. Finally, we split our dataset into train, tune and test partitions for use with our models. Below are some summary statistics about important variables in our data as well as a table showing our final cleaned data.
# load required libraries
library(C50)
library(caret)
library(class)
library(DT)
library(data.table)
library(MLmetrics)
library(mlbench)
library(mltools)
library(ROCR)
library(randomForest)
library(tidyverse)
music_genre_data <- read_csv("music_genre.csv")
# Removing identifier variables and mismanaged data columns (~50% of the duration variable indicated a song length of -1 ms)
music_genre = music_genre_data[-c(1,2,3,7,9,16)]
# convert '?' to NA
music_genre[music_genre == "?"] <- NA
# remove NAs
music_genre <- music_genre[complete.cases(music_genre),]
music_genre$tempo = as.numeric(music_genre$tempo)
normalize = function(x){
(x - min(x)) / (max(x) - min(x))
}
# normalize popularity, loudness and valence columns
music_genre[c(1,7,10)] = lapply(music_genre[c(1,7,10)], normalize)
# collapse factors of key column to group sharps and naturals
music_genre$key = fct_collapse(music_genre$key,
A = c("A", "A#"),
B = c("B", "B#"),
C = c("C", "C#"),
D = c("D", "D#"),
E = c("E", "E#"),
F = c("F", "F#"),
G = c("G", "G#"))
# change "Hip-Hop" value in genre column to a usable R name
music_genre[music_genre == "Hip-Hop"] <- "HipHop"
# rename genre column from "music_genre" to "genre"
names(music_genre)[names(music_genre)=="music_genre"] = "genre"
# change mode and music_genre columns to factor
music_genre$mode = as.factor(music_genre$mode)
music_genre$genre = as.factor(music_genre$genre)
lapply(music_genre[c(5,8,12)], table)
## $key
##
## A B C D E F G
## 7387 3398 9889 6146 3379 6660 8161
##
## $mode
##
## Major Minor
## 28874 16146
##
## $genre
##
## Alternative Anime Blues Classical Country Electronic
## 4495 4497 4470 4500 4486 4466
## HipHop Jazz Rap Rock
## 4520 4521 4504 4561
Each of the factor variables are well balanced within the dataset.
Boxplot of Danceability:
# find the 75th percentile of danceable songs
boxplot(music_genre$danceability)
danceabilitySummary <- summary(music_genre$danceability)
danceabilitySummary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0596 0.4420 0.5680 0.5585 0.6870 0.9860
print("The 75th percentile of danceability includes values of .687 and above")
## [1] "The 75th percentile of danceability includes values of .687 and above"
music_genre$danceability = (ifelse(music_genre$danceability > danceabilitySummary[5], 1, 0))
music_genre$danceability = fct_collapse(as.factor(music_genre$danceability), "bottom75" = "0", "top25" = "1")
# split up data into train, tune and test
set.seed(3001)
part1_indexing = createDataPartition(music_genre$danceability,
times = 1,
p = 0.70,
groups=1,
list=FALSE)
train = music_genre[part1_indexing,]
tune_and_test = music_genre[-part1_indexing,]
tune_and_test_index = createDataPartition(tune_and_test$danceability,
p = .5,
list = FALSE,
times = 1)
tune = tune_and_test[tune_and_test_index,]
test = tune_and_test[-tune_and_test_index,]
# show nicely formatted table of cleaned data
datatable(music_genre)
For our KNN model, we started using 25 nearest neighbors and every variable in the dataset. This model performed well with the negative class, which, fortunately is the class we are more interested in correctly predicting. However, at a sensitivity of only 50%, we would miss one of every two songs that would be “danceable”, and that may be too low for the bars.
KNN_train = train[-c(3,5,12)]
train1h = one_hot(as.data.table(KNN_train),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE)
KNN_tune = tune[-c(3,5,12)]
tune1h = one_hot(as.data.table(KNN_tune),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE)
Music_25NN = knn(train = train1h,
test = tune1h,
cl = train$danceability,
k = 25,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_25NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4733 918
## top25 334 768
##
## Accuracy : 0.8146
## 95% CI : (0.8051, 0.8238)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4405
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4555
## Specificity : 0.9341
## Pos Pred Value : 0.6969
## Neg Pred Value : 0.8376
## Prevalence : 0.2497
## Detection Rate : 0.1137
## Detection Prevalence : 0.1632
## Balanced Accuracy : 0.6948
##
## 'Positive' Class : top25
##
#Establishing dataframe with both the prediction and probability
Music_25NN_Prob = data.frame(pred = as_factor(Music_25NN), prob = attr(Music_25NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_25NN_Prob$prob = ifelse(Music_25NN_Prob$pred == "bottom75", 1 - Music_25NN_Prob$prob, Music_25NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_25NN = as_factor(ifelse(Music_25NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_25NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.549964054636952"
We assumed popularity, energy, liveliness, loudness, and tempo would best align with danceability, so we attempted to feature engineer the KNN model by only using these variables. However, this led to major decreases in both sensitivity and accuracy.
#Selecting the variables I predict would most closely correlate to danceability (popularity, energy, liveliness, loudness, and tempo)
Music_25NN_tuned = knn(train = train[c(1,4,6,7,10)],
test = tune[c(1,4,6,7,10)],
cl = train$danceability,
k = 25,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_25NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4641 1186
## top25 426 500
##
## Accuracy : 0.7613
## 95% CI : (0.7509, 0.7714)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : 0.01902
##
## Kappa : 0.2501
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.29656
## Specificity : 0.91593
## Pos Pred Value : 0.53996
## Neg Pred Value : 0.79646
## Prevalence : 0.24967
## Detection Rate : 0.07404
## Detection Prevalence : 0.13712
## Balanced Accuracy : 0.60624
##
## 'Positive' Class : top25
##
#Worsened accuracy (especially sensitivity)
When adjusting to 100 neighbors, the specificity marginally increased; however, this likely occurred because higher k’s favor the more prevalent class.
#Sensitivity drops a lot for a small improvement in specificity (likely more-so due to the imbalanced nature of the data than it predicting the observations better)
Music_100NN = knn(train = train1h,
test = tune1h,
cl = train$danceability,
k = 100,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_100NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4783 984
## top25 284 702
##
## Accuracy : 0.8122
## 95% CI : (0.8027, 0.8215)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4183
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4164
## Specificity : 0.9440
## Pos Pred Value : 0.7120
## Neg Pred Value : 0.8294
## Prevalence : 0.2497
## Detection Rate : 0.1040
## Detection Prevalence : 0.1460
## Balanced Accuracy : 0.6802
##
## 'Positive' Class : top25
##
#Establishing dataframe with both the prediction and probability
Music_100NN_Prob = data.frame(pred = as_factor(Music_100NN), prob = attr(Music_100NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_100NN_Prob$prob = ifelse(Music_100NN_Prob$pred == "bottom75", 1 - Music_100NN_Prob$prob, Music_100NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_100NN = as_factor(ifelse(Music_100NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_100NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.520723436322532"
Music_100NN_tuned = knn(train = train[c(1,4,6,7,10)],
test = tune[c(1,4,6,7,10)],
cl = train$danceability,
k = 25,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_100NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4639 1187
## top25 428 499
##
## Accuracy : 0.7608
## 95% CI : (0.7505, 0.771)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : 0.02334
##
## Kappa : 0.2489
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.29597
## Specificity : 0.91553
## Pos Pred Value : 0.53830
## Neg Pred Value : 0.79626
## Prevalence : 0.24967
## Detection Rate : 0.07389
## Detection Prevalence : 0.13727
## Balanced Accuracy : 0.60575
##
## 'Positive' Class : top25
##
At only 5 neighbors, the specificity and accuracy decreased because the model did not have enough information to predict danceability well. Overall, our first model of 25 neighbors with all variables likely best fit our business question.
Music_5NN = knn(train = train1h,
test = tune1h,
cl = train$danceability,
k = 5,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_5NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4555 857
## top25 512 829
##
## Accuracy : 0.7973
## 95% CI : (0.7875, 0.8068)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4193
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4917
## Specificity : 0.8990
## Pos Pred Value : 0.6182
## Neg Pred Value : 0.8416
## Prevalence : 0.2497
## Detection Rate : 0.1228
## Detection Prevalence : 0.1986
## Balanced Accuracy : 0.6953
##
## 'Positive' Class : top25
##
#Slightly increased sensitivity, but not enough and at the expense of accuracy
#Establishing dataframe with both the prediction and probability
Music_5NN_Prob = data.frame(pred = as_factor(Music_5NN), prob = attr(Music_5NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_5NN_Prob$prob = ifelse(Music_5NN_Prob$pred == "bottom75", 1 - Music_5NN_Prob$prob, Music_5NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_5NN = as_factor(ifelse(Music_5NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_5NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.548108825481088"
Music_5NN_tuned = knn(train = train[c(1,4,6,7,10)],
test = tune[c(1,4,6,7,10)],
cl = train$danceability,
k = 5,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_5NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4423 1056
## top25 644 630
##
## Accuracy : 0.7483
## 95% CI : (0.7377, 0.7586)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : 0.659
##
## Kappa : 0.2685
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.37367
## Specificity : 0.87290
## Pos Pred Value : 0.49451
## Neg Pred Value : 0.80726
## Prevalence : 0.24967
## Detection Rate : 0.09329
## Detection Prevalence : 0.18866
## Balanced Accuracy : 0.62328
##
## 'Positive' Class : top25
##
#Best model is likely the first model made (the increased in specificity were not worth the decreases in sensitivity)
For our decision tree, we first cross-validated a c5.0 model to find the ideal number of boosting iterations along with whether or not it should winnow (remove) variables of low importance. Ultimately, we used 20 boosting iterations and no winnowing, but this process was very computationally expensive.
Our initial model performed better than the KNN model, but we can still improve through tuning. We adjusted the hyperparameters of: 1. The number of minimum cases in each leaf node 2. The confidence factor (the threshold of error allowed in the data; the higher the number, the less pruning in the model.
c5_model = C5.0(danceability~.,
method = "class",
parms = list(split = "gini"),
data = train,
trials = 20,
control = C5.0Control(winnow = FALSE,
minCases = 500))
varImp(c5_model)
## Overall
## loudness 100.00
## tempo 100.00
## genre 100.00
## valence 98.29
## acousticness 92.05
## energy 89.98
## liveness 89.98
## speechiness 89.44
## popularity 48.89
## mode 36.47
## key 15.58
plot(c5_model)
The first split is for genre; intuitively, this makes sense because of the booleans that follow. Rap and Hip hop will be most danceable when they are high energy/fast tempo while other genres may be more suited towards slow dancing that work best with low energy/slow tempo
dance_prob = as_tibble(predict(c5_model, tune, type = "prob"))
dance_pred = predict(c5_model, tune, type = "class")
confusionMatrix(as.factor(dance_pred),
as.factor(tune$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4781 769
## top25 286 917
##
## Accuracy : 0.8438
## 95% CI : (0.8349, 0.8524)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.539
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5439
## Specificity : 0.9436
## Pos Pred Value : 0.7623
## Neg Pred Value : 0.8614
## Prevalence : 0.2497
## Detection Rate : 0.1358
## Detection Prevalence : 0.1781
## Balanced Accuracy : 0.7437
##
## 'Positive' Class : top25
##
#table = table(as.factor(dance_pred),
# as.factor(tune$danceability))
#(spec = table[1]/(table[1]+table[2]))
#F1 Score at .5 threshold
pred_5 = as_factor(ifelse(dance_prob$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.634821737625476"
ggplot(empty, aes(x = reorder(as.factor(min_cases), -F1), y = F1))+ geom_col(width = .8)+ geom_bar(data=subset(empty, min_cases == 50), aes(as.factor(min_cases), F1),
fill="green", stat="identity")
ggplot(empty, aes(x = reorder(as.factor(min_cases), -specificity), y = specificity))+ geom_col(width = .8)+ geom_bar(data=subset(empty, min_cases == 50), aes(as.factor(min_cases), specificity),
fill="green", stat="identity")
#50 seems to be the best compromise
ggplot(empty2, aes(x = reorder(as.factor(CF_level), -F1), y = F1))+ geom_col(width = .8)+ geom_bar(data=subset(empty2, CF_level==.9), aes(as.factor(CF_level), F1),
fill="green", stat="identity")
ggplot(empty2, aes(x = reorder(as.factor(CF_level), -specificity), y = specificity))+ geom_col(width = .8)+ geom_bar(data=subset(empty2, CF_level==.9), aes(as.factor(CF_level), specificity),
fill="green", stat="identity")
We found 50 cases and a confidence factor of .9 to be the best compromise between F1-score and specificity. The final model marginally decreased in specificity but greatly increased in sensitivity and F1.
c5_model_tune = C5.0(danceability~.,
method = "class",
parms = list(split = "gini"),
data = train,
trials = 20,
control = C5.0Control(winnow = FALSE,
minCases = 50,
CF = .9))
dance_prob_tune = as_tibble(predict(c5_model_tune, tune, type = "prob"))
dance_pred_tune = predict(c5_model_tune, tune, type = "class")
confusionMatrix(as.factor(dance_pred_tune),
as.factor(tune$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4720 631
## top25 347 1055
##
## Accuracy : 0.8552
## 95% CI : (0.8466, 0.8635)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5904
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6257
## Specificity : 0.9315
## Pos Pred Value : 0.7525
## Neg Pred Value : 0.8821
## Prevalence : 0.2497
## Detection Rate : 0.1562
## Detection Prevalence : 0.2076
## Balanced Accuracy : 0.7786
##
## 'Positive' Class : top25
##
#F1 Score at .5 threshold
pred_5_tune = as_factor(ifelse(dance_prob_tune$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_tune, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.683290155440415"
#Best model so far, should do roc curve to find threshold
dance_prob_tune = as_tibble(dance_prob_tune)
dance_eval_tune = tibble(pred_class=dance_pred_tune, pred_prob=dance_prob_tune$top25,target=as.numeric(tune$danceability))
pred = prediction(dance_eval_tune$pred_prob, dance_eval_tune$target)
#Choosing to evaluate the true positive rate and false positive rate based on the threshold
ROC_curve = performance(pred,"tpr","fpr")
plot(ROC_curve, colorize=TRUE)
abline(a=0, b= 1)
tree_perf_AUC = performance(pred,"auc")
print(paste("AUC =",tree_perf_AUC@y.values))
## [1] "AUC = 0.906414075118206"
#.5 is near the elbow of the graph (maybe could try .4 and .6)
dance_pred_4 = as_factor(ifelse(dance_prob$top25 > 0.4, "top25", "bottom75"))
confusionMatrix(as.factor(dance_pred_4),
as.factor(tune$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4592 565
## top25 475 1121
##
## Accuracy : 0.846
## 95% CI : (0.8372, 0.8545)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5815
##
## Mcnemar's Test P-Value : 0.005784
##
## Sensitivity : 0.6649
## Specificity : 0.9063
## Pos Pred Value : 0.7024
## Neg Pred Value : 0.8904
## Prevalence : 0.2497
## Detection Rate : 0.1660
## Detection Prevalence : 0.2363
## Balanced Accuracy : 0.7856
##
## 'Positive' Class : top25
##
The increases in sensitivity are not worth the decreases in specificity because it is our metric of interest.
dance_pred_6 = as_factor(ifelse(dance_prob$top25 > 0.6, "top25", "bottom75"))
confusionMatrix(as.factor(dance_pred_6),
as.factor(tune$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4897 987
## top25 170 699
##
## Accuracy : 0.8287
## 95% CI : (0.8195, 0.8376)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4545
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4146
## Specificity : 0.9664
## Pos Pred Value : 0.8044
## Neg Pred Value : 0.8323
## Prevalence : 0.2497
## Detection Rate : 0.1035
## Detection Prevalence : 0.1287
## Balanced Accuracy : 0.6905
##
## 'Positive' Class : top25
##
The improvements in specificity are not worth the great losses in sensitivity.
# function to calculate the mtry level
mytry_tune <- function(x){
xx <- dim(x)[2]-1
sqrt(xx)
}
set.seed(3001)
rfInit = randomForest(danceability~., #<- Formula: response variable ~ predictors.
# The period means 'use all other variables in the data'.
train, #<- A data frame with the variables to be used.
#y = NULL, #<- A response vector. This is unnecessary because we're specifying a response formula.
#subset = NULL, #<- This is unnecessary because we're using all the rows in the training data set.
#xtest = NULL, #<- This is already defined in the formula by the ".".
#ytest = NULL, #<- This is already defined in the formula by "PREGNANT".
ntree = 500, #<- Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets classified at least a few times.
mtry = mytry_tune(music_genre), #<- Number of variables randomly sampled as candidates at each split. Default number for classification is sqrt(# of variables). Default number for regression is (# of variables / 3).
replace = TRUE, #<- Should sampled data points be replaced.
#classwt = NULL, #<- Priors of the classes. Use this if you want to specify what proportion of the data SHOULD be in each class. This is relevant if your sample data is not completely representative of the actual population
#strata = NULL, #<- Not necessary for our purpose here.
sampsize = 3000, #<- Size of sample to draw each time.
nodesize = 5, #<- Minimum numbers of data points in terminal nodes.
#maxnodes = NULL, #<- Limits the number of maximum splits.
importance = TRUE, #<- Should importance of predictors be assessed?
#localImp = FALSE, #<- Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
proximity = FALSE, #<- Should a proximity measure between rows be calculated?
norm.votes = TRUE, #<- If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
do.trace = TRUE, #<- If set to TRUE, give a more verbose output as randomForest is run.
keep.forest = TRUE, #<- If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
keep.inbag = TRUE) #<- Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees?
# function to show call, variable importance and confusion matrix given a model
showModelOutput <- function(mdl, modelName) {
print("Call")
print(mdl$call)
print('Variable Importance')
print(mdl$importance)
varImpPlot(mdl, main=modelName)
plot(mdl, main=modelName)
mdlPredict = predict(mdl,
tune,
type = "response",
predict.all = FALSE,
proximity = FALSE)
confusionMatrix(as.factor(mdlPredict),
as.factor(tune$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
}
# function to output an F-1 score given a model
f1score = function(mdl){
mdlPredictprob = as_tibble(predict(mdl,
tune,
type = "prob",
predict.all = FALSE,
proximity = FALSE))
pred_5_mdl = as_factor(ifelse(mdlPredictprob$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_mdl, y_true = as_factor(tune$danceability), positive = "top25")))
}
# function to run the model with a given number of trees and randomly sampled variables
tuneModel <- function(numTrees, mTry) {
set.seed(3001)
rf = randomForest(danceability~., #<- Formula: response variable ~ predictors.
# The period means 'use all other variables in the data'.
train, #<- A data frame with the variables to be used.
#y = NULL, #<- A response vector. This is unnecessary because we're specifying a response formula.
#subset = NULL, #<- This is unnecessary because we're using all the rows in the training data set.
#xtest = NULL, #<- This is already defined in the formula by the ".".
#ytest = NULL, #<- This is already defined in the formula by "PREGNANT".
ntree = numTrees, #<- Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets classified at least a few times.
mtry = mTry, #<- Number of variables randomly sampled as candidates at each split. Default number for classification is sqrt(# of variables). Default number for regression is (# of variables / 3).
replace = TRUE, #<- Should sampled data points be replaced.
#classwt = NULL, #<- Priors of the classes. Use this if you want to specify what proportion of the data SHOULD be in each class. This is relevant if your sample data is not completely representative of the actual population
#strata = NULL, #<- Not necessary for our purpose here.
sampsize = 3000, #<- Size of sample to draw each time.
nodesize = 5, #<- Minimum numbers of data points in terminal nodes.
#maxnodes = NULL, #<- Limits the number of maximum splits.
importance = TRUE, #<- Should importance of predictors be assessed?
#localImp = FALSE, #<- Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
proximity = FALSE, #<- Should a proximity measure between rows be calculated?
norm.votes = TRUE, #<- If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
do.trace = TRUE, #<- If set to TRUE, give a more verbose output as randomForest is run.
keep.forest = TRUE, #<- If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
keep.inbag = TRUE) #<- Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees?
return(rf)
}
rf1 = tuneModel(250, 3)
rf2 = tuneModel(750, 3)
rf3 = tuneModel(1000, 3)
rf4 = tuneModel(250, 4)
rf5 = tuneModel(500, 4)
rf6 = tuneModel(750, 4)
rf7 = tuneModel(1000, 4)
showModelOutput(rfInit, 'Initial Model')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = 500,
## mtry = mytry_tune(music_genre), replace = TRUE, sampsize = 3000,
## nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0028082400 0.0297349585 0.0095300355 66.049461
## acousticness 0.0221714834 0.0268983771 0.0233517180 87.225350
## energy 0.0439789286 0.0198807787 0.0379645350 99.501703
## key 0.0003181391 0.0026313539 0.0008954235 52.845671
## liveness 0.0027286920 0.0067857634 0.0037413657 68.515600
## loudness 0.0194395663 0.0054037370 0.0159367254 67.903111
## mode 0.0007975935 0.0002112042 0.0006512238 7.134999
## speechiness 0.0265565318 0.0458372893 0.0313685228 127.479963
## tempo 0.0084609073 0.0398508084 0.0162957907 107.873982
## valence 0.0126543110 0.0463523295 0.0210648290 115.383466
## genre 0.0346956844 0.1146131197 0.0546443362 176.929165
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4795 747
## top25 272 939
##
## Accuracy : 0.8491
## 95% CI : (0.8403, 0.8576)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5555
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5569
## Specificity : 0.9463
## Pos Pred Value : 0.7754
## Neg Pred Value : 0.8652
## Prevalence : 0.2497
## Detection Rate : 0.1390
## Detection Prevalence : 0.1793
## Balanced Accuracy : 0.7516
##
## 'Positive' Class : top25
##
f1score(rfInit)
## [1] "F-1 Score at a .5 threshold: 0.646855563234278"
showModelOutput(rf1, '250 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0028034922 0.0286603324 0.0092590448 65.514197
## acousticness 0.0224116992 0.0252892875 0.0231300479 85.598486
## energy 0.0466930599 0.0195978548 0.0399319590 101.485561
## key 0.0002951075 0.0030493668 0.0009823990 53.015681
## liveness 0.0026328644 0.0065500539 0.0036106661 68.483126
## loudness 0.0202080817 0.0046957784 0.0163376597 68.494976
## mode 0.0008123317 0.0003138149 0.0006879538 7.328699
## speechiness 0.0275490084 0.0453778688 0.0319990136 128.621822
## tempo 0.0083738784 0.0401332938 0.0162998785 107.399952
## valence 0.0127894155 0.0469588975 0.0213172223 116.396327
## genre 0.0348725311 0.1138615015 0.0545869002 176.328575
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4787 748
## top25 280 938
##
## Accuracy : 0.8478
## 95% CI : (0.839, 0.8563)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5522
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5563
## Specificity : 0.9447
## Pos Pred Value : 0.7701
## Neg Pred Value : 0.8649
## Prevalence : 0.2497
## Detection Rate : 0.1389
## Detection Prevalence : 0.1804
## Balanced Accuracy : 0.7505
##
## 'Positive' Class : top25
##
f1score(rf1)
## [1] "F-1 Score at a .5 threshold: 0.645027624309392"
showModelOutput(rf2, '750 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0026352312 0.0299939521 0.0094631353 65.92190
## acousticness 0.0213912856 0.0274745531 0.0229099082 87.21707
## energy 0.0425295248 0.0201312823 0.0369400841 98.92768
## key 0.0003559939 0.0025749304 0.0009096668 52.47710
## liveness 0.0028145191 0.0065896949 0.0037569159 68.65797
## loudness 0.0193361446 0.0052561346 0.0158229870 68.41699
## mode 0.0008073480 0.0002389608 0.0006654871 7.26443
## speechiness 0.0260895894 0.0457414228 0.0309939135 126.99184
## tempo 0.0085384197 0.0403734133 0.0164834476 108.57072
## valence 0.0124648080 0.0457061269 0.0207604290 115.19557
## genre 0.0343391656 0.1160565503 0.0547346378 177.76502
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4799 750
## top25 268 936
##
## Accuracy : 0.8493
## 95% CI : (0.8405, 0.8577)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5552
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5552
## Specificity : 0.9471
## Pos Pred Value : 0.7774
## Neg Pred Value : 0.8648
## Prevalence : 0.2497
## Detection Rate : 0.1386
## Detection Prevalence : 0.1783
## Balanced Accuracy : 0.7511
##
## 'Positive' Class : top25
##
f1score(rf2)
## [1] "F-1 Score at a .5 threshold: 0.647282796815507"
showModelOutput(rf3, '1000 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0027166174 0.0298640409 0.0094931241 66.123537
## acousticness 0.0206330249 0.0273184314 0.0223022812 86.062525
## energy 0.0418746953 0.0204750002 0.0365341273 98.908287
## key 0.0003613395 0.0023986769 0.0008696947 52.335094
## liveness 0.0027710398 0.0066395753 0.0037367792 68.758263
## loudness 0.0190288220 0.0055507681 0.0156650561 68.474415
## mode 0.0007790151 0.0002131674 0.0006377681 7.280667
## speechiness 0.0262426278 0.0458532197 0.0311377663 127.039464
## tempo 0.0085760195 0.0404317843 0.0165273109 108.911625
## valence 0.0124620964 0.0457422752 0.0207688024 114.732622
## genre 0.0341784379 0.1162955839 0.0546763345 177.847999
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4799 747
## top25 268 939
##
## Accuracy : 0.8497
## 95% CI : (0.8409, 0.8581)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5568
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5569
## Specificity : 0.9471
## Pos Pred Value : 0.7780
## Neg Pred Value : 0.8653
## Prevalence : 0.2497
## Detection Rate : 0.1390
## Detection Prevalence : 0.1787
## Balanced Accuracy : 0.7520
##
## 'Positive' Class : top25
##
f1score(rf3)
## [1] "F-1 Score at a .5 threshold: 0.649153128240581"
showModelOutput(rf4, '250 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0021069248 0.0301827390 0.0091114684 63.460726
## acousticness 0.0215261149 0.0244784442 0.0222617471 84.910277
## energy 0.0441774110 0.0202964868 0.0382161685 102.338821
## key 0.0003943582 0.0022576506 0.0008592551 52.876098
## liveness 0.0026858278 0.0068944839 0.0037366359 70.569347
## loudness 0.0188357309 0.0043609772 0.0152247219 64.368892
## mode 0.0006677411 0.0001788687 0.0005457108 6.770831
## speechiness 0.0270498898 0.0448028211 0.0314799100 124.934020
## tempo 0.0092647960 0.0428661158 0.0176491298 114.613336
## valence 0.0143455969 0.0493261326 0.0230750192 120.600984
## genre 0.0377116565 0.1227544696 0.0589338879 188.437400
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4791 729
## top25 276 957
##
## Accuracy : 0.8512
## 95% CI : (0.8425, 0.8596)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5637
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5676
## Specificity : 0.9455
## Pos Pred Value : 0.7762
## Neg Pred Value : 0.8679
## Prevalence : 0.2497
## Detection Rate : 0.1417
## Detection Prevalence : 0.1826
## Balanced Accuracy : 0.7566
##
## 'Positive' Class : top25
##
f1score(rf4)
## [1] "F-1 Score at a .5 threshold: 0.65430827325781"
showModelOutput(rf5, '500 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0022303421 0.029081292 0.0089301939 62.970153
## acousticness 0.0209604834 0.025464504 0.0220844360 84.177256
## energy 0.0435635985 0.020111480 0.0377090409 100.916646
## key 0.0003187114 0.002511198 0.0008658181 53.691086
## liveness 0.0027421037 0.007008868 0.0038068898 70.357032
## loudness 0.0186393241 0.004205398 0.0150374037 64.791202
## mode 0.0006633483 0.000144631 0.0005338648 6.966023
## speechiness 0.0266254183 0.045246869 0.0312725506 124.242029
## tempo 0.0092047879 0.044082530 0.0179084027 116.091403
## valence 0.0142293411 0.049178236 0.0229515894 119.903127
## genre 0.0381948041 0.122052868 0.0591235032 188.648798
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4795 734
## top25 272 952
##
## Accuracy : 0.851
## 95% CI : (0.8423, 0.8594)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5624
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5647
## Specificity : 0.9463
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.8672
## Prevalence : 0.2497
## Detection Rate : 0.1410
## Detection Prevalence : 0.1813
## Balanced Accuracy : 0.7555
##
## 'Positive' Class : top25
##
f1score(rf5)
## [1] "F-1 Score at a .5 threshold: 0.653819683413627"
showModelOutput(rf6, '750 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0023475280 0.0290621781 0.0090131985 63.387407
## acousticness 0.0210153213 0.0260702110 0.0222762663 84.024737
## energy 0.0434541648 0.0209415764 0.0378348005 100.754690
## key 0.0003259417 0.0024115271 0.0008464242 53.950837
## liveness 0.0027709160 0.0068952264 0.0038000770 70.360076
## loudness 0.0188266921 0.0043728043 0.0152188948 65.291249
## mode 0.0006892206 0.0001863619 0.0005636816 6.830315
## speechiness 0.0266887768 0.0456021366 0.0314095277 124.756277
## tempo 0.0092703640 0.0439295714 0.0179196095 115.695801
## valence 0.0139625578 0.0492097970 0.0227590552 120.161151
## genre 0.0377956426 0.1213235163 0.0586417271 187.920914
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4796 734
## top25 271 952
##
## Accuracy : 0.8512
## 95% CI : (0.8425, 0.8596)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5627
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5647
## Specificity : 0.9465
## Pos Pred Value : 0.7784
## Neg Pred Value : 0.8673
## Prevalence : 0.2497
## Detection Rate : 0.1410
## Detection Prevalence : 0.1811
## Balanced Accuracy : 0.7556
##
## 'Positive' Class : top25
##
f1score(rf6)
## [1] "F-1 Score at a .5 threshold: 0.654282765737874"
showModelOutput(rf7, '1000 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees,
## mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5,
## importance = TRUE, proximity = FALSE, norm.votes = TRUE,
## do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
## bottom75 top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity 0.0023803830 0.0294547451 0.0091363689 64.780576
## acousticness 0.0209007604 0.0263255993 0.0222551631 84.584563
## energy 0.0432719177 0.0206952529 0.0376365048 100.473251
## key 0.0003379178 0.0023090811 0.0008299471 53.976000
## liveness 0.0027112367 0.0067801025 0.0037266605 69.982193
## loudness 0.0189376095 0.0040407258 0.0152192093 65.452419
## mode 0.0006551678 0.0001891369 0.0005387720 6.811971
## speechiness 0.0264732573 0.0450157173 0.0311020025 124.169033
## tempo 0.0092899192 0.0440812127 0.0179728595 115.550913
## valence 0.0141687552 0.0489990978 0.0228617403 119.912000
## genre 0.0376795038 0.1204929300 0.0583484363 186.637727
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4798 728
## top25 269 958
##
## Accuracy : 0.8524
## 95% CI : (0.8437, 0.8607)
## No Information Rate : 0.7503
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5666
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5682
## Specificity : 0.9469
## Pos Pred Value : 0.7808
## Neg Pred Value : 0.8683
## Prevalence : 0.2497
## Detection Rate : 0.1419
## Detection Prevalence : 0.1817
## Balanced Accuracy : 0.7576
##
## 'Positive' Class : top25
##
f1score(rf7)
## [1] "F-1 Score at a .5 threshold: 0.657731958762887"
KNN_test = test[-c(3,5,12)]
test1h = one_hot(as.data.table(KNN_test),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE)
Music_25NN_Final = knn(train = train1h,
test = test1h,
cl = train$danceability,
k = 25,
use.all = TRUE,
prob = TRUE)
confusionMatrix(as.factor(Music_25NN_Final), as.factor(test$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction bottom75 top25
## bottom75 4665 917
## top25 402 768
##
## Accuracy : 0.8047
## 95% CI : (0.795, 0.814)
## No Information Rate : 0.7504
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4192
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4558
## Specificity : 0.9207
## Pos Pred Value : 0.6564
## Neg Pred Value : 0.8357
## Prevalence : 0.2496
## Detection Rate : 0.1137
## Detection Prevalence : 0.1733
## Balanced Accuracy : 0.6882
##
## 'Positive' Class : top25
##
#Establishing dataframe with both the prediction and probability
Music_25NN_Final_Prob = data.frame(pred = as_factor(Music_25NN_Final), prob = attr(Music_25NN_Final, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_25NN_Final_Prob$prob = ifelse(Music_25NN_Final_Prob$pred == "bottom75", 1 - Music_25NN_Final_Prob$prob, Music_25NN_Final_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_KNN = as_factor(ifelse(Music_25NN_Final_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_KNN, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.536978618997546"
dance_prob_test = as_tibble(predict(c5_model_tune, test, type = "prob"))
dance_pred_test = predict(c5_model_tune, test, type = "class")
confusionMatrix(as.factor(dance_pred_test),
as.factor(test$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4654 654
## top25 413 1031
##
## Accuracy : 0.842
## 95% CI : (0.8331, 0.8506)
## No Information Rate : 0.7504
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5569
##
## Mcnemar's Test P-Value : 2.022e-13
##
## Sensitivity : 0.6119
## Specificity : 0.9185
## Pos Pred Value : 0.7140
## Neg Pred Value : 0.8768
## Prevalence : 0.2496
## Detection Rate : 0.1527
## Detection Prevalence : 0.2139
## Balanced Accuracy : 0.7652
##
## 'Positive' Class : top25
##
#F1 Score at .5 threshold
pred_5_test = as_factor(ifelse(dance_prob_test$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_test, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.65899648449984"
rfPredict = predict(rf7,
test,
type = "response",
predict.all = FALSE,
proximity = FALSE)
rfPredictprob = as_tibble(predict(rf7,
test,
type = "prob",
predict.all = FALSE,
proximity = FALSE))
confusionMatrix(as.factor(rfPredict),
as.factor(test$danceability),
dnn = c("Predicted", "Actual"),
mode = "sens_spec",
positive = "top25")
## Confusion Matrix and Statistics
##
## Actual
## Predicted bottom75 top25
## bottom75 4738 742
## top25 329 943
##
## Accuracy : 0.8414
## 95% CI : (0.8324, 0.85)
## No Information Rate : 0.7504
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5388
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5596
## Specificity : 0.9351
## Pos Pred Value : 0.7414
## Neg Pred Value : 0.8646
## Prevalence : 0.2496
## Detection Rate : 0.1397
## Detection Prevalence : 0.1884
## Balanced Accuracy : 0.7474
##
## 'Positive' Class : top25
##
pred_5_rf = as_factor(ifelse(rfPredictprob$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_rf, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.637347767253045"
Moving forward, I would recommend the bars use our single tree model if they value correctly predicting “dancy” songs and the random forest if they want to ensure they do not play a buzzkill because they are more conservative. In the future, it may be productive to try the model using regression methods instead of classification because the danceability variable begins as a continuous variable.